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1. Introduction 

The solution of large sparse systems of linear equations 

Ax = b (1.1) 

is one of the most frequently encountered tasks in numerical computations. For exam- 
ple, such systems arise from finite difference or finite element approximations to partial 
differential equations. For Hermitian positive definite coefficient matrices A, the classical 
conjugate gradient method (CG hereafter) of Hestenes and Stiefel [11] is one of the most 
powerful iterative schemes for solving (1.1), especially when combined with a precondi- 
tioning technique. For general non- Hermitian matrices, the situation is less satisfactory. 
An ideal CG-type method for solving non-Hermitian linear systems would have features 
similar to the classical CG algorithm. It would produce approximate solutions to (1.1) 
which: 

(i) are characterized by a minimization property over Krylov subspaces generated by A\ 

(ii) can be computed with little work and low storage requirements per iteration. 

Unfortunately, for general non-Hermitian matrices, there are no CG-type algorithms which 
fulfill both requirements (i) and (ii); this was proved by Faber and Manteuffel [2]. Instead, 
most CG-type methods for non-Hermitian linear systems satisfy either (i) or (ii). 

In the first category, the most successful scheme is the generalized minimal residual 
algorithm (GMRES) by Saad and Schultz [21]. It fulfills (i), but not (ii), since work and 
storage per iteration grow linearly with the iteration number. Consequently, in practice, 
one cannot afford to run the full algorithm and it is necessary to use restarts. For difficult 
problems, this often results in very slow convergence. 

In the second category, the archetype is the bi conjugate gradient algorithm (BCG 
hereafter) due to Lanczos [13]. At least in the generic case, BCG is based on simple three- 
term recurrences, which keep work and storage requirements constant at each iteration. 
However, the BCG iterates are defined by a Galerkin condition rather than a minimization 
property (i), which means that the algorithm can exhibit — and typically does — a rather 
irregular convergence behavior with wild oscillations in the residual norm. Furthermore, 
in the BCG algorithm, breakdowns — more precisely, division by 0 — may occur. In finite 
precision arithmetic, such exact breakdowns are very unlikely; however, near-breakdowns 
may occur, leading to numerical instabilities in subsequent iterations. Recently, two mod- 
ifications of BCG, namely CGS [22] and Bi-CGSTAB [24], have been proposed. However, 
while these methods seem to work well in many cases, they do not address the problem 
of breakdowns, and thus they too, like BCG, are susceptible to instabilities. In exact 
arithmetic, both CGS and Bi-CGSTAB break down every time BCG does. 

In this paper, we present a novel BCG-like approach for general nonsingular non- 
Hermitian linear systems (1.1), the quasi-minimal residual algorithm (QMR hereafter), 
which overcomes the problems of BCG. The method uses a look-ahead variant of the 
nonsymmetric Lanczos process to generate basis vectors for the Krylov subspaces induced 
by A. The look-ahead Lanczos approach was first proposed by Taylor [23] and Parlett, 
Taylor, and Liu [18]. For the QMR method, we use the implementation of the look-ahead 
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Lanczos process recently developed by Freund, Gutknecht, and Nachtigal [6, 7]. Using the 
Lanczos basis, the actual QMR iterates are then defined by a relaxed version of (i), namely 
a quasi-mi nimal residual property. This approach was first proposed by Freund [5] for the 

special case of linear systems with complex symmetric coefficient matrices A = A T . 

The QMR method can be implemented using only short recurrences and hence it still 
satisfies the requirement (ii). The quasi-minimal residual property ensures that QMR, 
unlike BCG, converges smoothly; moreover, existing BCG iterates can also be easily and 
stably recovered from the QMR process. Finally, for the QMR method, it is possible to 
obtain error bounds which are essentially the same as the standard bounds for GMRES. 
To the best of our knowledge, this is the first convergence result for a BCG-like algorithm. 

The outline of this paper is as follows. In Section 2, the implementation of the look- 
ahead Lanczos algorithm derived in [6, 7] is briefly outlined. In Section 3, we describe 
the basic idea of the QMR approach. In Section 4, details of an implementation of the 
QMR algorithm axe given. In Section 5, we discuss the connection of QMR with BCG. 
In Section 6, a convergence theorem for QMR is derived. In Section 7, we show how to 
incorporate preconditioning into the QMR method and describe two preconditioners which 
we have used for our numerical tests. In Section 8, we present numerical examples. Finally, 
in Section 9, we make some concluding remarks. 

Throughout the paper, all vectors and matrices, unless otherwise stated, are assumed 
to be complex. As usual, M T — ( rriji ) and M H = (rriji) denote the transpose and 
the conjugate transpose, respectively, of the matrix M = (m jj). We use c max (M) and 
a m i D (M) for the largest and smallest singular value of M, respectively. The vector norm 

||a:|| = Vx H x is always the Euclidean norm and ||M|| = cr mKX (M) is the corresponding 
matrix norm. The set of eigenvalues of a square matrix M is denoted by A (M). We use 
the notation 

K n (c, B) := span {c, Be , . . . , B n ~ l c) 

for the nth Krylov subspace of C generated by c 6 and the N x N matrix B. 
Furthermore, it is always assumed that A is a complex, in general non-Hermitian, N x N 
matrix. 

Finally, one more note. In our formulations of BCG and of the nonsymmetric Lanczos 

algorithm, we use A T rather than A H . This was a deliberate choice in order to avoid 
complex conjugation of the scalars in the recurrences; the algorithms can be formulated 
equally well in either terms. 
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2. The Look-Ahead Lanczos Algorithm 

Given two nonzero starting vectors tq £ C N and w\ £ C^, the classical nonsymmetric 
Lanczos method [12], [25, pp. 388-394] generates two sequences of vectors i>i, i>2> ■ ■ . , v n 
and Wi, W 2 , ■ • • , w n , n — 1,2,..., such that 


span {ui , i>2j ■ ■ • 5 v n} = A'„(ui,A), 

span{u>i,u> 2 ,..- ,^n} = K n (wi,A T ), 


( 2 . 1 ) 


and 


f 0 if j ± 

W ■ V, = \ 

[dj^O if j - 1 

The actual construction of each new pair v n +i and w n +i 
two simple three-term recurrences. If 


( 2 . 2 ) 

of Lanczos vectors is based on 


Wn+lV n+ l = 0, (2.3) 

the Lanczos algorithm needs to be terminated, since (2.3) would lead to a division by 0 at 
the next iteration. In exact arithmetic, (2.3) will occur after a finite number, m = n (< TV), 
of steps. If (2.3) is caused by t> m+ i = 0 or u; m +i = 0, then K m {y u A) is ^-invariant or 

K m (wi,A T ) is A T -invariant, respectively, and, by (2.1), the Lanczos process has con- 
structed a basis for this invariant subspace. This is referred to as regular termination. 
Unfortunately, it can also happen that (2.3) is satisfied with u m +i ^ 0 and u> m +i 7^ 0. In 
this case, the Lanczos algorithm stops before an invariant subspace has been found. This 
is referred to as serious breakdown [25, p. 389]. 

It is the possibility of serious breakdowns, or, in finite precision arithmetic, of near- 
breakdowns , i.e. 

w% +1 v n+1 ps 0, 9j ^n+l 9^ 9) 

that has brought the classical nonsymmetric Lanczos algorithm into discredit. However, 
there are so-called look-ahead [23, 18] variants of the Lanczos process which allow to skip 
— except in the very special case of an incurable breakdown [23] — over those iterations 
in which the standard algorithm would break down. We refer the reader to [9, 10, 17] 
for a detailed theory of the look-ahead Lanczos process. In [6, 7], we have developed a 
robust implementation of the look-ahead Lanczos algorithm, which we briefly sketch here; 
for details and the actual FORTRAN code, see [6, 7]. 

Like the classical process, the look-ahead Lanczos algorithm generates two sequences of 
vectors v \ , v? , . . . , v n and w\ , u>2, ■ ■ ■ 1 n = 1,2,..., which satisfy (2.1). Possible break- 
downs are prevented by relaxing the biorthogonality condition (2.2) whenever a breakdown 
(exact or near) would occur. More precisely, for each fixed n = 1,2, . . ., the Lanczos vec- 
tors v, , . . . , v n and w\ , . . . , w n generated by the look-ahead algorithm can be grouped into 
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k — k(n ) blocks 


Vj — [ v ni Um+i • • ■ fni +l -i ] ? 
V k — [u„* v„ k+1 • • • v n ] , 


Wi = [w nt 

W n , + 1 W ni+l 

1 = 1,2,. 

. . , A? — 1, 

W k = [w n „ 

W nk + l ' • • Wn], 


(2.4) 


where 


1 = ni < «2 < ■ • • < nj < • • • < n/t < n < n*+i . 


The blocks axe constructed such that, instead of (2.2), we have 



V t 


f 0 if j # /, 

l A if 3 = 1 , 


jj — 1,2, ... ,i, 


(2.5) 


where 

D\ is nonsingular, / = 1, 2, . . . ,k — 1, and D k is nonsingular if n = n k + 1 — 1. (2.6) 

In the sequel, we denote by 

hi — n/, 1 — 1, 2, . . . , k 1, h k n 

the number of vectors in each block. The first vectors v„, and w n , in each block axe 
called regular , the remaining vectors are called inner. The Arth block is called complete if 
n — njt+i — 1; in this case, at the next step n + 1, a new block is started with the regular 
vectors v nk+l and u; nt+1 . Otherwise, if n < n k +i — 1, the Arth block is incomplete and at 

the next step, the Lanczos vectors u n +i and u> n +i are added to the Arth block as inner 
vectors. 

With these preliminaries, the basic structure of the look-ahead Lanczos algorithm is 
as follows. 

Algorithm 2.1 (Sketch of the look-ahead Lanczos process). 

0) Choose t>i, u>i £ C N with ||i>i|| = ||u>i|| = 1; 

Set V 1 =v l ,W l =w l ,D 1 = W?V l ; 

Set m = 1, A: = 1, t> 0 = u?o = 0, Vo = W 0 = 0, p\ = 6 = 1; 

For n = 1,2, .. . : 

1) Decide whether to construct v n +i and u>„+i as regular or inner vectors 
and go to 2) or 3), respectively; 

2) (Regular step.) Compute 


Vn+i = Au n - V k D~ k l Wl - V^D^W^Av n , 
w n +i - A T w n - W k D k T V k A T w n - W k _ 1 D k ^ 1 V k i 1 A T w n 
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set n*+i = n + 1, k = k + 1, V k = W k = 0, and go to 4); 

3) (Inner step.) Compute 

= ~ (n — rik v n ~ { r ln — ni c /pn) v n — \ l WfcL j Au n , 

^n+1 A W n C n — nk w n { r ln—ni c /^n) w n — l — — l 1 A W n ] 


4) Compute p n+1 = ||u„ +: || and £ n+ i = ||t5 n+1 ||; 
if Pn+ 1 = 0 or £ n+ i = 0, stop; 

Otherwise, set 


V n +1 = Vn+l/Pn+l, W n+1 = tr n+1 /^ n+1 , 

Vk = [V k » B+ i], W* = [W* w n +i ] , D k =WjV k . 


(2.9) 


If only regular steps 2) are performed, all blocks have size hi = 1 and Algorithm 2.1 reduces 
to the classical Lanczos process. Therefore, the strategy for the decision in step 1) should 
be such that regular steps axe performed whenever possible and blocks of size hi > 1 are 
built only to avoid exact or near-breakdowns. In [6], we proposed a practical procedure 
for the decision in step 1) based on three different checks. Clearly, for a regular step, it 
is necessary that D k is nonsingular. Therefore, one of the checks monitors the relative 
size of 0 m in(D k ). The other two checks monitor the size of the components along the two 
previous blocks of vectors V k and F*_i respectively W k and W k - \ in (2.7). A regular step 

is performed only if these terms do not dominate the components Av n respectively A T w n 
in the new Krylov subspaces. These checks are necessary to ensure linear independence of 
the Lanczos vectors. We refer the reader to [6] for details. 

At each step, Algorithm 2.1 requires the computation of Euclidean norms of two 
vectors, v n +\ and w n + 1 , of length N. In addition, inner products of vectors of length N 

occur in the terms W k-1 Av n , V k '_ 1 Aw n , W k 'V k , and, if a regular step is performed, in 

W k Av n and V k Aw n . However, as shown in [6], all these inner products can be generated 
stably based on the actual computation of only two inner products per iteration. Therefore, 
Algorithm 2.1 requires the same number of inner products per iteration as the standard 
nonsymmetric Lanczos process. 

Note that, in (2.8), the inner recurrence coefficients (j, r; } , j = 0, 1, ... , rjo = 0, are 

still arbitrary. For example, Chebyshev iteration [14] would be a possible choice for these 
coefficients. On the other hand, in practice, the blocks built by the look-ahead Lanczos 
algorithm are usually small and the choice (j = rjj = 0 is feasible. 

Next, we list some properties of Algorithm 2.1 which will be used in the sequel. First, 
in view of (2.9), we have 


= liu>. 


= 1, n = 1,2, . . . . 


( 2 . 10 ) 
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It is convenient to introduce the notation 


y (n) = [V! U 2 ... v n ] (= [VI V 2 ... V*]), 
W (n) = [ti>! 1 02 ... w n ] (= [Wi W 2 ... Wfc]). 



* * 0 0 *" 

Pni+i * *• : • 0 0 p n , 

0 Pm + 2 ’• '• 0 : 7i= : 0 . (2.15) 

■ : : 

! o ••• o . 

0 ' ' ' ' ' ' 0 — 1 *- 

The blocks j3i are in general full matrices. Furthermore, for / = 1, . . . , k — 1, the matrices 
aj, /3/, and 7/ are of size h t x h/, h ( - 1 x hi, and h t x h,-i , respectively. The matrices a k , (3 k , 

and 7 k corresponding to the current block k are of size h k x h k , h k -i x h k , and h k x h k - 1, 

respectively. Here h k = h k if the kth block is complete. 

In view of (2.14) and (2.15), if (n) is an upper Hessenberg matrix with subdiagonal 
elements 

Pj> 0, j = 2, 3, . . . , n. (2.16) 

In exact arithmetic, the stopping criterion in step 4) of Algorithm 2.1 will be satisfied 
after at most N steps — except in a very special situation. Recall that in order to close the 
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current block k, it is necessary that WjfVh is nonsingular. However, in general, it cannot be 
excluded that Algorithm 2.1 produces infinite blocks V* and W* of nonzero Lanczos vectors 

such that WjfV k is the infinite zero matrix. This is called an incurable breakdown [23]; 
such a breakdown is very rare and does not present a problem in practice. Furthermore, 
even in the case of an incurable breakdown, the look-ahead Lanczos process still yields 
information on the spectrum of A, as Taylor [23] showed in his Mismatch Theorem (see 
also [9, 17]). For later use, we summarize the termination properties of the look-ahead 
Lanczos process in the following 

Proposition 2.2. There is a “termination index” m < N such that, in exact arithmetic, 
Algorithm 2.1 will either stop in step n = m with p m +i = 0 or = 0, or, starting with 
the regular vectors u m +i and w m +i, an incurable breakdown will occur. If p m + i = 0 or 
£ m+1 = 0, then vi,...,v m or Wi , ... , w m span the A-invariant subspace K m (v i , A) or the 

A T -invariant subspace K m (w\, A T ), respectively. Moreover, in all cases, 

A (ff (m) ) C A(A). (2.17) 
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3. The Quasi-Minimal Residual Approach 

In this section, we describe the basic idea of the QMR approach for the solution of linear 
systems (1.1). From now on, it is always assumed that A is nonsingular. 

Given any initial guess xq £ C N for the exact solution A —1 b of (1.1), we will construct 
iterates x„, n = 1,2,..., such that 


x n £ x 0 + K n (r 0 ,A). (3.1) 

Let r n = b — Ax n denote the residual vector corresponding to the nth iterate x n , and set 

Pa = |M|, vi - r 0 /po. (3.2) 

Let vi,V 2 ,. . . ,v n be the right Lanczos vectors generated by Algorithm 2.1, with the nor- 
malized initial residual vi as one of the two starting vectors. From (2.12), the u’s span 
Kn(ro,A), and hence we have the parametrization 

Xn = X 0 + F (n) z, z £ C n , (3.3) 

for all possible iterates (3.1). Note that the second starting vector, w\ £ C N , is still 
unspecified. Due to the lack of a criterion for the choice of w\ , one usually sets wj = v\ in 
practice. 

Next, letting 


H < n > 

Pn+ l(ei,”>) T 


*!■“> = [o 


0 l] T eR", 


:= 

we can rewrite (2.13) as 

Ay (") — 

From (3.2) and (3.5), the residual vectors corresponding to (3.3) satisfy 

r n = r 0 - AV™z = r 0 - y("+Utf (") z = F (n+1) (p 0 e[ n+1) - i? e (n) z) , 


(3.4) 


(3.5) 

(3.6) 


where ej” +1 ^ = [1 0 • • • O] 7 ^ 6 R n+1 . Next, we introduce an (n + 1) x (n + 1) diagonal 

weight matrix 


fi (n) = diag(w 1 ,w 2 ,...,a> B+1 ), Uj > 0, ; = l,...,n + l, (3.7) 

to serve as a free parameter that can be used to modify the scaling of the problem. With 
it, (3.6) reads 

r n = V( n+1 > ( « (n) )~ 1 ft (n) (poe ( ! n+1) - H[ n) z^ 

= V in+1) (fl (n) ) _1 (d (n) - n (n) H< n) z) , with d (n) = u lPo e[ n+1) . (3.8) 
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Ideally, we would like to choose z (E C” in (3.8) such that ||r n || is minimal. However, since 

in general is not unitary, this would require 0(Nn 2 ) work, which is too expensive. 

We will instead minimize just the Euclidean norm of the bracketed terms in (3.8), i. e., we 

will choose z = 6 C n as the solution of the least squares problem 


d {n) - ft (n) H[ n) z (n) 


min 
z G C" 


d (n) - rt n) Hl n) z . 


(3.9) 


By (2.16), (3.4), and (3.7), H[ n) and are (n + 1) x n matrices with full column 

rank n. This guarantees that the solution z^ of (3.9) is unique and hence, via (3.3), 
defines a unique nth iterate x n . In view of the minimization property (3.9), we refer 
to this iteration scheme as the quasi- minimal residual (QMR) method. Clearly, the QMR 
iterates still depend on the choice of the weights u>j in (3.7). In our numerical experiments, 
the simplest scaling 

= 1, j = 1,2,..., (3.10) 

gave satisfactory results. Recall from (2.10) that all the columns of are unit vectors. 

Hence, the scaling (3.10) ensures that all basis vectors Vj/u)j, j = l,...,n + 1, in the 
representation (3.8) of r n have the same Euclidean length; this is a “natural’ 5 requirement. 
However, better strategies for choosing might be possible, and therefore we have 

formulated the QMR approach with a general scaling matrix f2 (n) . 

For the solution of the least squares problem (3.9), we use the standard approach (see, 

e.#., [8, Chapter 6]) based on a QR decomposition of 


= (<3 <n) ) W R q" 


(3.11) 


Here, Q (n) is a unitary (n + 1) x (n + 1) matrix, and R (n) is a nonsingular upper triangular 
n x n matrix. Inserting (3.11) in (3.9) yields 

|(Q (n) ) // (<? (n) d (n) - ^ 


mm 

z ec n 


d (n) - ft (n) tf< n) z 


= mm 

z e C" 


o 


= mm 
z € C n 


Q(")<f( n ) - 




0 

Z 


Hence, z ^ is given by 
z (n) = (i? (n) ) 
Furthermore, we have 


^ t^ n \ where t ^ = 

'Ti ' 

, 

' t {n) ' 

J 





. T n - 




= g (n) d (n) . 


d( n) - ft (n) tf £ (n) z (n) =|f B+1 |. 


( 3 . 12 ) 


(3.13) 


We conclude this section by summarizing the basic structure of the QMR algorithm. 
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Algorithm 3.1 (QMR algorithm). 

0) Choose x 0 € C N and set r 0 = b — Ax 0 , p 0 = ||r 0 ||, Vi = r 0 /po/ 

Choose wi € C N with ||iwi|| = 1; 

For n = 1,2,... : 

1 ) Perform the nth iteration of the look-ahead Lanczos Algorithm 2.1; 
This yields matrices V^ n \ y( n+1 ) ) which satisfy (3.5); 

2) Update the QR factorization (3.11) offl^H^ and the vector i. 

3) Compute 

x n = z 0 + V (n) (i? (n) )~\ (n) ; 

4) If x„ has converged, stop. 


( 3 . 12 ); 

(3.14) 


10 


4. Implementation Details 

In this section, we give some of the details for the actual implementation of steps 2), 3), 
and 4) of the QMR Algorithm 3.1. In particular, it is shown that the QMR iterates x„ can 
be computed with short recurrences. This approach for updating the iterates x n is based 
on a technique which was first used by Paige and Saunders [16] in connection with their 
SYMMLQ and MINRES algorithms for real symmetric matrices. 

First, note that the QR decomposition (3.11) of can be computed by means 

of n Givens rotations, talcing advantage of the fact that ^ is an upper Hessenberg 

matrix. Hence, the unitary factor in (3.11) is of the form 


Q (n) = G 


3 

\ 

t— » 

O' 


r— ( 

1 

O 

0 

1 


1 

O 

1 


(4.1) 


where, for j = 1, 2, . . . , n, 






with Cj 6 R, Sj G C, c 2 + M 2 — 1. 


Recall that, in view of (2.14), Q^H ( e n) is block tridiagonal. Therefore, the upper triangular 
factor in (3.11) is of the form 


’d)l £2 $3 0 

0 62 


R (n) = 



0 

0 

0 k 


0 


ejt 

0 8k. 


(4.2) 


where the blocks 81 and e/ are of the same size as the blocks a/ and /?/, respectively, in (2.14). 
Moreover, the diagonal blocks 81 are nonsingular upper triangular matrices. Clearly, a QR 
decomposition based on unitary matrices (4.1) limits fill-in to the row above each block 81 
in (2.14). Hence each of the blocks 0 t in (4.2) has possible nonzero entries only in its last 

row. 

Next, we note that the decomposition (3.11) is easily updated from the factorization 

of at the previous step n — 1. Indeed, to obtain one only needs to 

compute its last column, 

[w ^„] I ' = R ( ” ) eS." ) , ( 4 -3) 
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and append it to This is done by first multiplying the last column of f 

by the previous Givens rotations; by (2.14), this last column has zero entries in positions 
1,2, . . . ,nc where 


n G = 


raax(njt_i — 1, 1) if v n is an inner vector, 

max(njt _2 — 1, 1) if v n is a regular vector. 


Therefore, only the Givens rotations with indices ng, no + 1, . . . , n — 1 have to be applied, 
and, by setting 


Hi 


H 
v J 


• 


r g„_i oi 


[ G n -2 01 


Hn-1 


i 

o 
1 


1 

r—f 

o 



G 


n<; 


0 


0 

■fnc — 1 




(4.4) 


we obtain the desired vector (4.3) up to its last component fj, n . It remains to multiply (4.4) 
by a suitably chosen Givens rotation G n which zeros out the last element v = u; n +i/ 9 n+1 . 
To achieve this, set 


Cn = 


H 


1 / 


, 

H 


if H ^ 0, 


+ M 2 

C„ = 0, 5^=1, if p = 0, 

and finally one gets ji n = c n /j. + s n v. For later use, we notice that 


(4.5) 


l-Sn^nl — W n+ i/) n+ i, 


(4.6) 


which is readily verified using (4.5). The vector in (3.12) is updated by setting 

(<"> = Gn r 1 '"” 1 ’ 


Clearly, differs from only in its last two entries which are given by 

r n = c n T n and f n+1 = -s^ f„. (4.7) 

Next, we turn to the computation of the QMR iterates x n in (3.14). We define vectors 
Pj via 

P (n) = [pi P2 ... = (4.8) 
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Then, with (3.14) and (3.12), it follows that 

= — 1 4 “ Priori' 

It remains to show how to compute p„. In analogy to the partitioning of I /(n) in (2.4) and 
(2.11), we group the columns of P ^ into blocks 

P<") = [Pi P 2 ... Pc]. ( 4 - 9 ) 

With (4.8), (4.2), and (4.9), one obtains the relation 


P k = (V k - P k - ie* - Pk- 2 Sk)S k \ 


and thus p n can be updated via short recurrences. 

Finally, for step 4) of Algorithm 3.1, a convergence criterion is needed. We stop the 
QMR iteration as soon as 

||rn||<toH|r 0 ||; ( 4 ' 10 ) 

here tol is a suitable tolerance, e.g. tol = 10 -6 . In the QMR algorithm described so far, 
neither the residual vectors r„ nor their norms ||r n || are generated explicitly. However, in 
part a) of the next proposition, we derive an upper bound for ||r„|| which is available at 
no extra cost. In our implementation, the convergence criterion is checked for this upper 
bound, (4.11), rather than ||r n ||. Once this test is satisfied, we switch to checking (4.10) 
for the true residual norm |jr„||. Typically, this is necessary only in the last one or two 
iterations, since (4.11) is a good upper bound for ||r n || (see Section 8 for examples). 

The residual vector itself can be easily updated at the expense of one additional 
SAXPY per iteration, based on the recursion given in part b) of the following 

Proposition 4.1. For n = 1,2,... : 
a) 

||r„|| < || r 0 1| Vn + 1 |sis 2 • • • s n -is n \ m ax +] («i/wj)i (4.11) 


b) 


i i2 i 

r„ = p n I r„_i 4 u n +i- 

^n-\ 1 


(4.12) 


Proof. By taking norms in (3.8) and with (3.13), we obtain 

|M| < ||V' (n+1) || • ||(S2 < " l r I H • I’Wil- 


(4.13) 


Now, from (2.10) and (2.11), V ( " +1) has n + 1 columns of Euclidean norm 1, and this 
implies 

||l/( n+1) || < Vn+ 1. ( 4 - 14 ) 
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Furthermore, by (3.7), 


(4.15) 


||(n < ” ) r , || < . max (1/wj). 

Finally, by (4.7), 

|f n +i| = |fi| • |si$2 ' • ’ s n-lSn| , (4-16) 

where, in view of (3.12), (3.8), and (3.2), 

fi«IM|wi. ( 4 - 17 ) 

By combining (4.13-17), one obtains the inequality (4.11). 

Now we turn to paxt b). By inserting z — from (3.12) in (3.8) and using (3.11), 
one obtains 

r n = (4.18) 


where 


yn + 1 = V (R+1) (fl (n) ) 1 (Q {n) ) H 


O' 

0 

1 


From (4.1), one readily verifies that two successive vectors y n + i a^nd y„ in (4.18) are 
connected by 

yn +1 = -s n y n + ~ ^n+l ■ (4-19) 

w„+i 

Finally, by inserting (4.19) in (4.18) and using the second relation in (4.7), we arrive at 
(4.12). □ 
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5. The Connection Between QMR and BCG 

In this section, we are concerned with the connection between QMR and BCG. In partic- 
ular, it is shown that BCG iterates can be easily recovered from the QMR process. 

In the BCG approach, one aims at computing iterates x n which are characterized by 
the Galerkin type condition 

w T (b — Ax n ) = 0 for all w G K n (ro,A T ), x n G x 0 + 7v„(r 0 , A). (5.1) 

(see, e.g., [19]). Here, fo G is any nonzero vector. Usually, one sets fo = r o. In the 
classical BCG algorithm [13, 4], the iterates (5.1) are generated as follows. 

Algorithm 5.1 (BCG). 

0) Choose x 0 G C N and set q 0 = ro = b — Aro; 

Choose ro G C N , ro ^ 0, and set qo = ro; 

For n = 1,2,... : 

1) Compute 6 n = r„_ 1 r n _ 1 /qT_ 1 Aq n _ 1 and set x n = i„_i -f S n q n -i; 

Set r n = r n _i - 8 n Aq n - 1 and f„ = f n -i - 8 n A T q n -i; 

2) Compute p n = r?r n /r^r^; 

Set q n — r n -f- p n q n —\ and q n 1 r n -h PnQn—i! 

3) If r n — 0 or f n = 0 , stop. 

BCG is closely related to the classical nonsymmetric Lanczos algorithm. Indeed (see 
e.g. [19]), for n = 1,2,..., 

Tn-l = 4>nV n , <t>n € C, <f> 7 ^ 0, and fn - 1 = TpnWn, G C, Ip f 0, (5.2) 

where v n and w n denote the vectors generated by the standard Lanczos process with 
starting vectors 

v\ = T*o/||r 0 1| and wi = fo/||fo||- (5-3) 

Unfortunately, like the Lanczos algorithm, BCG is also susceptible to breakdowns and 
numerical instabilities. Obviously, Algorithm 5.1 breaks down prematurely, if 

Qn-lQn-l ~ 9 , fn — 1 7 ^ 0, r n — 1 ^ 0, (5-4) 

or 

f^_ir„_! = 0, f„_i 7 ^ 0, r„_! ^ 0, (5.5) 

occurs. We will refer to (5.4) and (5.5) as breakdown of the first and second kind, respec- 
tively. In general, Galerkin iterates (5.1) need not exists for every n. This is the cause 
of the breakdown of the first kind. Indeed, one can show that (5.4) occurs if no BCG 
iterate x n exists. Breakdowns of the second kind have a different cause: by (5.2), (5.5) is 
equivalent to a serious breakdown in the classical nonsymmetric Lanczos process. 
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Next, we rewrite the Galerkin condition (5.1) in terms of the look-ahead Lanczos 
Algorithm 2.1, started with the initial vectors (5.3). This yields a formulation of the BCG 
approach for which breakdowns of the second kind, except for ones caused by an incurable 
breakdown in the look-ahead Lanczos process, cannot occur. In analogy to (3.3), we use 
the parametrization 

x n =x 0 + V (n V n) , u w ec n , (5.6) 

for the BCG iterates. Then, by (2.13), the corresponding residual vector satisfies 

r n = b - Ax n = V (n) (7 (n) - i? (n) u (n) ) - («<">) vn+u 

V V n (5.7) 

with — [pa 0 0] T G R n . 

By inserting (5.7) in (5.1) and using (2.12), it follows that the iterate (5.6) satisfies (5.1) 
if, and only if, 


^PT (n) ) T y ( " ) (/ (n) -LT (n) u (n) ) = (u (n) ) (kF (n) ) T 6„-(-i. (5.8) 

To simplify the discussion of (5.8), we will attempt to recover the BCG iterate only when 
the current block k in Algorithm 2.1 is complete. Therefore, in the sequel, it is always 
assumed that n = njt + i - 1. This ensures that, in view of (2.5-6), the linear system (5.7) 
reduces to 

J?( n V n) = /<">, (5.9) 

from which we can now derive a simple criterion for the existence of the nth BCG iterate. 

In the following proposition, superscripts BCG and QMR are used to distinguish it- 
erates and residuals of the BCG and QMR approaches. The remaining notation is as 
introduced in Sections 3 and 4. 

Proposition 5.2. Let n = n fc+ i - 1, Jb = 0,1,. ... Then, the following three conditions 
are equivalent: 

(i) the BCG iterate x defined by (5.1) exists; 

(ii) is nonsingular; 

(Hi) c„ t^O. 

Moreover, if x^ 00 exists, then 


BCG QMR , ' r «l Sn | 2 

x„ — -r 2 Pm 

c n 


(5.10) 



(5.11) 

Cn 
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Proof. Clearly, an nth BCG iterate exists if, and only if, the linear system (5.9) has a 
solution. From (5.7) and (2.14-16), the extended coefficient matrix [/ (n ^ H ^ ] of (5.9) 
is an upper Hessenberg matrix whose subdiagonal elements are all nonzero, and thus it 

has full row rank n. Consequently, (5.9) has a solution if, and only if, H (n ^ is nonsingular. 
This shows the equivalence of (i) and (ii). 

Next, using (3.11), (3.4), and (4.1), one readily verifies that 


Q(n-l)^(n-l) H (n) 


In - 1 
0 



(5.12) 


This relation implies that (ii) and (iii) are equivalent. 

Now assume c n ^ 0. From (5.9) and (5.12) it follows that 


u 


n) = (R (n) ) 


-l 


In - 1 

0 


0 

l/Cn 


Q(n-l)Q{n-l) f(n) 


(5.13) 


Recalling the definitions of d {n) and / (n) in (3.8) and (5.7), and using (3.12), we can rewrite 
(5.13) as follows: 


u 


» = z (n) + (r m ) 1 [ , 0 

V / T n j Cn T n 


(5.14) 


By comparing (5.6) and (5.14) with (3.3) and (3.12), and by using (4.8), we obtain the 
relation 


Pn 


z™ = xf“ + - T„). 

which, by (4.7), is just (5.10). By inserting (5.9) in (5.7), it follows that 

rf° = -(*<”>) _e. +1 . 

From (5.15), (5.14), and (3.12), we obtain 

where fi n = (R* n ^) . (5.16) 


(5.15) 


„BCG I 


Vn + 1 | 

T n 

Cn 

Pn 


In view of (4.6), 


||5n+l|| = Pn+l = 


w„+l 


(5.17) 


Then, by inserting (5.17), (4.16), and (4.17) in (5.16), we get (5.11), and this concludes 
the proof. □ 
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Remark 5.S. Proposition 5.2 shows that existing BCG iterates can be recovered easily 

from the QMR process. By (5.11), ||r^ CG || can be computed at no extra cost from quantities 
which are generated in the QMR Algorithm 3.1 anyway. In particular, one may monitor 

||r^ CG || during the course of the QMR iteration, and compute x^ 00 via (5.10) whenever 
the actual BCG iterate is desired. 

Remark 5-4- CGS [22] and Bi-CGSTAB [24] are modifications of the BCG Algorithm 
5.1. In many cases, these algorithms have better convergence properties than BCG. How- 
ever, neither CGS nor Bi-CGSTAB addresses the problem of breakdowns. Indeed, one can 
show that, in exact arithmetic, CGS as well as Bi-CGSTAB break down every time BCG 
does. Numerical examples for that will be given in Section 8. 
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6. A Convergence Theorem 

In this section, we derive bounds for the QMR residuals which are essentially the same as 
the standard bounds for GMRES. To the best of our knowledge, this is the first convergence 
result for a BCG-like algorithm for general non-Hermitian matrices. 

Let m denote the termination index of the look-ahead Lanczos Algorithm 2.1, as 
introduced in Proposition 2.2. We remark that, in exact arithmetic, the QMR Algorithm 
3.1 will also terminate in step n = m. For a diagonalizable matrix A/, we denote by 

k(M)= min Ill'll * ll^ _1 II 

X : X 1 MX diagonal 


the condition number of the eigenvalue problem of M. Furthermore, we denote by II n the 
set of all complex polynomials of degree at most n. 

With these notations, the main result of this section can be formulated as follows. 

Theorem 6.1. Suppose that the m x m matrix generated by m steps of the look- 

ahead Lanczos Algorithm 2.1 is diagomdizable, and set 


H = (6.1) 

Then, for n = 1,2, . . . ,m — 1, the residual vectors of the QMR Algorithm 3.1 satisfy 

||r n || < ||r 0 ||k(J7) y/n + 1 e (n) . max (u>i/wj), (6.2) 

• j=l,...,n+l 

where 

= min max |P(A)|. (6-3) 

Pen„: P(0) = 1 AeA ^> 

Moreover, if Algorithm 2.1 terminates with p m + 1 = 0, then x m = A 1 b is the exact 
solution of Ax = b. 

Proof. Using (4.13-15), (3.13), (3.8-9), and (3.2), one readily verifies that 

||r„|| < ||r 0 || \/n + lti n . max (wi/wy). 


where t? n is given by 


•d n = min 
z 6 C" 


> +1) - fi( n )p. (n >z 


.("+») _ 


= [« 


i] T e 


R 


71+ 1 


(6.4) 


Therefore, for the proof of (6.2), it remains to show that 

tin < *(H) £ {n) . 


(6.5) 
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In the following, let n G {1, 2, . . . , m — 1} be arbitrary, but fixed. By 


.ff ( m ) 


H, <*> »' 
0 * 


and (6.1), we have 


z 


'a( n >#i n) (fl (n - 1) )-V 

0 


0 


for all z EC n . 


( 6 . 6 ) 


Recall that H^ m \ and therefore also H, is an upper Hessenberg matrix with nonzero 
subdiagonal elements. This implies that 


g c” ) = P en„.,) 


(6.7) 


Using (6.6-7), we can rewrite (6.4) as follows: 


i? n = min 

zec n 


e\ m) - H 


min I|p(i7)ei m ^ 

pg n n : p(o) = i«' 


( 6 . 8 ) 


is assumed diagonalizable, so by (6.1) H is also diagonalizable, and by expanding 
into any set of eigenvectors of H , we deduce from (6.8) that 


i9 n < k(H ) min max |P(A)|. 

P G II n : P(0) = 1 ^A(H) 1 


(6.9) 


By (6.1) and (2.17), we have A (H) - A (P (m) ) = A(A), and thus (6.9) is equivalent to the 
desired inequality (6.5). 

Finally, we need to show that x m = A -1 6, if Algorithm 2.1 terminates with p m + 1 = 0. 
For n = m and p m + 1 = 0, the least squares problem (3.9) reduces to a linear system with 

coefficient matrix Since A is nonsingular, by (2.17), this linear system is 

nonsingular, and hence it can be solved exactly. Therefore, r m = 0 and this concludes the 
proof. □ 

Recall (cf. Proposition 2.2) that, in exact arithmetic, it can also happen that the QMR 
algorithm terminates with p m + i ^ 0. In this case, one restarts the QMR method, using 
the last available QMR iterate as the new initial guess. Theorem 6.1 shows that x m -i 
is a good choice. However, the finite termination property of the look-ahead Lanczos 
Algorithm 2.1 is usually lost in finite precision arithmetic. In particular, situations where 
the QMR algorithm needs to be restarted are very rare in practice. 
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We remark that for the “natural” scaling^- = 1, the bound (6.2) simplifies somewhat. 
Next, we contrast the bounds (6.2) for QMR with the standard bounds [21] for GM- 

RES. Assume that A is a diagonalizable matrix. Then, the residuals r % MRES generated by 
the GMRES algorithm (without restarts) satisfy 


| r GMRES|| < ||r 0 ||/c(A)£ (n) , n = 1,2,... , 


where, as before, is given by (6.3). Hence, up to the slow growing factor \fn + 1 in 
(6.2) and different constants, the error bounds for QMR and GMRES are essentially the 
same. 

In general, simple upper bounds for (6.3) are known only for special cases. For ex- 
ample, assume that the eigenvalues are contained in an ellipse in the complex plane which 
does not contain the origin: 

A(A) C S, 0 £ £. 

Let /i ^ f “2 denote the two foci of £. The ellipse can be represented in the form 


£ = 


A G C 


1^ — /i I + 1^ — / 2 I 


< I/1-/2I 
2 



with r > 1. 


Moreover, let R be the unique solution of 


1 

2 



\h\±\ M 

I/X-/2I ’ 


R> 1. 


It can be shown that 0 £ 5 implies R > r. Then, by [3, Theorem 2], we have the following 
upper bound for (6.3): 

£ <n) < r" + 1/r" 


R n + 1/R ’ 


-, n = 1,2,... . 
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7. Preconditioning 

As for other conjugate gradient type methods, when solving realistic problems, it is crucial 
to combine the QMR algorithm with an efficient preconditioning technique. In this section, 
we show how to incorporate preconditioners into the QMR algorithm. Also, we briefly 
describe two preconditioning techniques. 

Let M be a given nonsingular N x N matrix which approximates in some sense the 
coefficient matrix A of the linear system (1.1), Ax = b. Moreover, assume that M is 
decomposed in the form 

M = Mi M 2 • ( t • 1 ) 

Instead of solving the original system (1.1), we apply the QMR algorithm to the equivalent 
linear system 

A'y = b', where A! — M 1 -1 AM^ -1 , b' — M 1 1 (b — Axo), y ~ M?(x — xo). 

Here x 0 denotes some initial guess for the solution of Ax = b. The iterates y„ and residual 
vectors T r n — b 1 — A'y n for the preconditioned system (7.1) are transformed back into the 
corresponding quantities for the original system by setting 

x n = x 0 + M 2 -1 y„ and r n = M\r' n . (7.2) 

For the special cases Mi = I or M 2 — I in (7.1) one obtains right or left preconditioning, 
respectively. 

Using (7.2), the QMR Algorithm 3.1 combined with preconditioning can be sketched 
as follows: 

Algorithm 7.1 (QMR approach with preconditioning). 

0) Choose xq € C N and set r' 0 = M 1 1 (b — Axo), po = H^oll, v\ = r(,/ po, yo — 0; 
Choose wi £ C N with ||i«i|| = 1; 

For n = 1,2,...: f 

J ^ Perform the nth iteration of the look-ahead Lanczos Algorithm 2.1 ( applied to A ), 

This yields matrices V^ n \ y( n+1 ), which satisfy A'V^ — V^ n+1 ^H e \ 

2) Update the QR factorization (3.11) of f2 (n) J7 e (n) and the vector t (n) in (3.12); 

3) Compute y„ = 

4) If y n has converged, compute x n = xo + M 2 1 y n , and stop. 

In the case of right or left preconditioning, Algorithm 7.1 simplifies somewhat. In 
general, however, for the QMR algorithm applied to a preconditioned system, one has to 

be able to compute Mj -1 z, M(~ T z, M 2 1 2 , and M 2 T z , for arbitrary vectors z. 

For the numerical examples in Section 8, we have used SSOR and ILUT(k) precondi- 
tioning. 
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• SSOR 

The SSOR preconditioner is based on a decomposition of the matrix A into a non- 
singular diagonal matrix D , a strictly lower triangular matrix L, and a strictly upper 
triangular matrix U, such that A — D L U . D might have to be block diagonal to 
ensure it is nonsingular. The preconditioner matrix M is given by 

M = (D -f L)D~ 1 (D + U). 

For our experiments, we have used SSOR as a right or left preconditioner. 

• ILUT(fc) 

The Incomplete LU decomposition is based on the LU decomposition of the coefficient 
matrix A into a unit lower triangular matrix L and an upper triangular matrix U . The full 
LU decomposition of A would result in factors L and U which, in general, have far more 
nonzero elements than A. The incomplete LU factorization aims to reduce this additional 
fill-in in the factors L and U . 

In ILUT(fc), we use a strategy due to Saad [20] for dropping nonzero elements which 
would fill-in L and U . Each row of L and U is constructed subject to the restriction that 
only a small amount of fill-in, k more elements for each, is allowed beyond the number 
of elements of A already present in that row (in the lower and upper part, respectively). 
Furthermore, elements which Eire deemed to make only an insignificant contribution to the 
decomposition are also dropped. For example, this means that if nmaxi is the maximum 
number of elements allowed for some row of L, til is the actual number of elements of 
that row computed by the elimination process, and ctol is the cutoff tolerance, then the 
algorithm orders the til elements in decreasing order of magnitude, and keeps only up 
to min (til, nmaxi) elements, or until the elements reach the level ctol , whichever cutoff 
comes first. The resulting matrices L and U can be used either as M\ = L and M 2 = U 
in (7.1), or in M 2 = LU respectively M\ = LU for right respectively left preconditioning. 

The variant of ILU used is different from the standard one. For a Hermitian matrix 
A , the standard ILU preconditioner [15] preserves the sparsity structure of the matrix, i.e., 
for k = 0, the preconditioner matrices have nonzero elements only in those locations where 
A itself has nonzero elements. In [15] it is shown that this strategy does produce a good 
preconditioner, provided that A is a Hermitian Af-matrix. For a general non-Hermitian 
matrix, there is no reason to preserve the sparsity structure of A. Instead, the ILUT(k) 
variant discards elements subject only to the constraints of fill-in and size, without regard 
to the sparsity structure of A. However, this does mean that if A is Hermitian, we do not 
recover the standard ILU preconditioner. 
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8. Numerical Experiments 

We have performed extensive numerical experiments with the QMR algorithm and the 
other iterative methods mentioned in this paper. In this section, we present a few typical 
results of these experiments. Further numerical results are reported in [7] and, for the case 
of complex symmetric matrices, in [5]. 

For our test runs, we always chose xo = 0 as initial guess. The iteration was stopped 
as soon as the convergence criterion (4.10) (with tol — 10 -6 ) was satisfied. We always 

used the QMR algorithm with no scaling, i.e., = I n +i in (3.7). For GMRES [21], 

work and storage per iteration step n grows linearly with n and hence one needs to use 
restarts. In the sequel, GMRES(no) refers to GMRES restarted after every no iterations. 
A typical value for the restart parameter is no = 20; work and storage per iteration are 
then roughly comparable for GMRES and QMR. Furthermore, unless stated otherwise, the 
BCG iterates were always obtained from the QMR process via (5.10), rather than from the 
BCG Algorithm 5.1. Examples 1, 3, and 4 were run using right preconditioning. We also 
tried left preconditioning, and in all cases, the number of iterations needed to converge was 
roughly the same. In all our experiments with QMR, the look-ahead Lanczos Algorithm 
2.1 performed almost exclusively regular steps and only few blocks of size h\ > 1 were 
built. The biggest block that occurred was of size hi = 4. For all examples, we report 
the exact numbers of blocks of size 2, 3, and 4. Finally, the convergence plots show the 
relative residual norms ||r n ||/||ro|| (on the vertical axis) versus the iteration number n (on 
the horizontal axis). 

All examples were run on a Cray-2 at the NASA Ames Research Center. 

Example 1 . We consider the partial differential equation 


— Au + 7 + y~Q^ ■+■ z ~Q~ S j 4" P u — f on (0» 1) x (0) 1) x (0) 1)) (8.1) 

with Dirichlet boundary conditions. We discretize (8.1) using centered differences on a 
uniform 25 x 25 x 25 grid with mesh size h = 1/26. The resulting linear system has 
a sparse coefficient matrix A of order N — 15625 with 105625 nonzero elements. For 
our experiments, we have chosen the parameters /? = —250 and 7 = 40. Note that this 
choice guarantees that the cell Reynolds number is smaller than one, and hence centered 
differences yield a stable discretization of (8.1). The right-hand side was chosen such 
that the vector of all ones is the exact solution of the linear system. QMR was run with 
identical starting vectors v\ = w\. This example was run with SSOR preconditioning. 
In Figure 1, we show the convergence curves for QMR (solid line), BCG (dotted line), 
GMRES(19) (dash-dotted line), and GMRES(20) (dashed line). As the plot indicates, the 
convergence curve for QMR is rather smooth; we also see the typical oscillations in the 
BCG convergence curve. Also, we see that GMRES, while being optimal as long as it is 
not restarted, loses its edge once it is restarted, and furthermore its behavior after being 
restarted can be quite sensitive to the length n 0 of the restart cycle. Finally, we note that 
in the course of the QMR run, 4 blocks of size 2 were built. 
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Figure 1. Convergence curves for Example 1, right SSOR. 


Example 2. This example was taken from the Harwell- Boeing set of sparse test matrices 
[1]. It is the fifth matrix from the SHERMAN collection, called SHERMAN 5. It comes 
from a fully implicit black oil simulator on a 16 x 23 x 3 grid, with three unknowns per grid 
point. The order of the matrix is 3312, and it has 20793 nonzero elements. The right-hand 
side b was chosen as a vector with random entries from a normal distribution with mean 
0.0 and variance 1.0. Again, QMR was run with identical starting vectors v\ = w\. 

If no preconditioning is used, this linear system is very difficult to solve by iterative 
methods. As Figure 2 shows, QMR needs almost 1500 iterations to converge. On the 
other hand, GMRES(IOO) (dash-dotted line), even with an unrealistically large restart 
value no = 100, does not converge within a reasonable number of iterations. Figure 2 also 
shows the relative residual norms (dots) of the BCG iterates as obtained from the QMR 
algorithm. During the QMR run, the underlying look-ahead Lanczos algorithm built 49 
blocks of size 2, 7 blocks of size 3, and 1 block of size 4. 

Next, we ran the SHERMAN 5 problem with SSOR and ILUT(0) preconditioning 
(cf. Section 7). For ILUT(0) the cutoff tolerance ctol = .001 was chosen; this choice 
leads to approximate factors L and U which together have 19899 nonzero elements. In 
Figure 3 (for SSOR) and Figure 4 (for ILUT(0)), we show the convergence curves for QMR 
(solid line), BCG (dotted line), and GMRES(20) (dash-dotted line). We have also plotted 
(dashed line) the upper bound (4.11) for the QMR residuals. Clearly, these bounds are very 
close to the true QMR residuals. In Figure 4, we have also added the convergence curve 
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(lower dash-dotted line) for a full GMRES run without restart to the one for GMRES(20). 
A comparison of this curve with the QMR curve shows that the quasi-minimal residual 
property is very close to the true minimal residual property of GMRES. Finally, we note 
that, for both SSOR and ILTJT(O), only blocks of size 1 were built during the QMR run. 
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Examples 3 and 4. Next, we present two examples which were constructed such that 
a breakdown of the first kind (Example 3) respectively of the second kind (Example 4) 
occurs (cf. Section 5). In both cases the matrix A is of order N = 20. For Example 3, 
the matrix H$ generated by 5 steps of the Lanczos process is singular and thus, in view 

of Proposition 5.2, xf 00 does not exist. Consequently, BCG, CGS, and Bi-CGSTAB all 
break down in step 5 (cf. Remark 5.4). Example 4 was constructed such that an exact 
breakdown occurs in step 2 of the standard Lanczos algorithm. This then leads to a 
breakdown of BCG, CGS, and Bi-CGSTAB in step 3. In Figure 5 and Figure 6, we show 
the convergence curves for QMR (solid line), BCG (dotted line), CGS (dashed line), and 
Bi-CGSTAB (solid line becoming vertical at step 5 respectively 3). In both cases, the BCG 
iterates were computed by the BCG Algorithm 5.1 before the breakdown occurs and via 
the QMR process after the breakdown. Finally, we note that, for both examples, 2 blocks 
of size 2 were built during the QMR run. 



Figure 5. Convergence curves for Example 3. 
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9. Concluding Remarks 

In this paper, we have proposed a robust iterative solver, the QMR algorithm, for general 
nonsingular non-Hermitian linear systems. The method uses a recently proposed [6, 7] 
robust implementation of the look-ahead Lanczos algorithm to generate basis vectors for 
the Krylov subspaces -K n (ro, A). The QMR iterates are characterized by a quasi-minimal 
residual property over I\ n (ro, A). Both the look-ahead Lanczos algorithm and the compu- 
tation of the actual QMR iterates can be implemented using only short recurrences. The 
QMR approach is closely related to the BCG algorithm; however, unlike BCG, the QMR 
algorithm has smooth convergence curves and good numerical properties. Furthermore, we 
have derived bounds for the QMR residuals which are essentially the same as the standard 
bounds for GMRES. To the best of our knowledge, this is the first convergence result for 
a BCG-like algorithm for general non-Hermitian matrices. 

FORTRAN 77 implementations of the QMR method and the underlying look-ahead 
Lanczos algorithm are available electronically from the authors (na.freund@na-net.ornl.gov 
or na.nachtigal@na-net.ornl.gov). 
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